Model of Wikipedia growth based on information exchange via reciprocal arcs 
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We show how reciprocal arcs significantly influence the structural organization of Wikipedias, 
online encyclopedias. It is shown that random addition of reciprocal arcs in the static network 
cannot explain the observed reciprocity of Wikipedias. A model of Wikipedia growth based on 
preferential attachment and on information exchange via reciprocal arcs is presented. An excellent 
agreement between in-degree distributions of our model and real Wikipedia networks is achieved 
without fitting the distributions, but by merely extracting a small number of model parameters from 
the measurement of real networks. 
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I. INTRODUCTION 

Since lately Wikipedias have been a vibrant inter- 
disciplinary field of study [J, 0, 0, |, H H 0. The 
unique character of the free editing article policy and 
the large number of people participating in the process, 
make Wikipedia excellent model system for investigation 
of some complex system ideas in a realistic environment 
of the real social structure. Indeed, in the last few years 
there has appeared a growing amount of evidence sup- 
porting the usage of ideas from statistical physics, graph 
theory etc. in the description of the social or economic 
phenomena. This especially applies to phenomena which 
previously seemed untouchable from the natural scien- 
tists point of view [H, 0, [l(| ■ 

One of the very interesting features previously ob- 
served in Wikipedias is their reciprocity [2J. Reciprocal 
arcs are just the arcs pointing from the vertex % to the 
vertex j for which exists an arc pointing from vertex j 
to the vertex i. The reciprocity is then defined as the 
fraction of reciprocal arcs in the total number of arcs 
r = ^j— It was previously shown that reciprocal 



arcs can have an interesting role in real networks and in 
the theory describing them [l^, EH, EH • It also seems to 
be the most stable network measure one can find in the 
ensemble of Wikipedias except possibly the in-degree dis- 
tribution exponent ■ In [HI , it was also shown that the 
reciprocity of Wikipedias cannot be explained by random 
mixing of arcs. In this paper we show in which manner re- 
ciprocal arcs influence the observed Wikipedia structure 
and show that they represent the necessary ingredient for 
understanding the Wikipedia growth and organization. 



II. RECIPROCITY IN WIKIPEDIA 



that were true, then the first assumption should be that 
the reciprocal arcs are distributed over the underlying 
Wikipedia network corresponding to mutual similarity of 
different articles. In other words, the content similarity 
of two articles is supposed to be independent of degrees of 
those two articles. One way to study this independence 
is laid out in the companion paper [161 ] . There we show 
that the independence of reciprocal arcs can be studied 
using the the inverse matrix of the process of random 
addition of reciprocal arcs. Particularly we can use the 
equation 



(S(0)>=Tr„ 1 (p)S'(p), 



(1) 



to study such a process. In Eq. (p} S'(p) represents the 
vector of product moments of degrees observed in the real 
network, p = ^y- represents the fraction of unidirectional 
arcs that were transformed to bidirectional, TJ^ is the 
inverse of the transformation matrix and (S(0)) repre- 
sents the expected vector of product moments of degrees 
in the network without reciprocal arcs. In Fig.[T]we show 
that some types of correlations indicate that the assump- 
tion of the degree independent reciprocal arcs can not be 
justified in Wikipedia networks. It is obvious than the 
parameter p can not be larger then ~ 0.07 and from the 
data we know that the parameter p should be around 
- 0.16. 

This analysis is based on a very strong assumption of 
network stationarity. We know that the Wikipedia net- 
works grow as many different Wikipedians edit many ar- 
ticles. Clearly it is necessary to investigate the influence 
of reciprocal arcs on the growth of Wikipedia. In we 
showed that different Wikipedias grow in a very similar 
fashion and that the number of newly added arcs is not 
linear with respect to the number of vertices. Neverthe- 
less the observed behavior L ~ TV 1 - 14 is close enough to 
linear that we can approximate it by a linear growth. 



In order to understand how reciprocal arcs affect the 
structure of Wikipedia, first we need to examine if they 
exhibit any peculiar behavior at all. In the case of 
Wikipedia, one can expect existence of the reciprocal arcs 
between articles that share certain portion of content. If 



III. MODEL 

The model we use to describe the growth of Wikipedias 
is studied in detail in (lfjj and we just summarize the idea 
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Figure 1: The change of expected initial correlations of onev- 
ertex degrees {kik r ) calculated from the inverse of transfor- 
mation matrix T]~„ for three different Wikipedias. Expected 
values of monitored correlations are changing the sign for the 
value of parameter about p ~ 0.07. Since the product mo- 
ment correlations are strictly positive, this behavior indicates 
that there is just a small fraction of reciprocal arcs which are 
degree independent. In the case of Wikipedias the maximal 
value of parameter p is about p ~ 0.16. 



of the model and list the most important analytical re- 
sults. The studied model was inspired by our findings in 
the Wikipedia networks 0. Other authors studied the 
growth of Wikipedia networks with focus on preferential 
attachment [|[ and they found a linear-like relationship 
between the in-degree of the vertex and its probability 
to acquire a new arc, at least for the small and medium 
vertex degrees. It is also known that there is a signifi- 
cant portion of new arcs forming between old vertices in 
the network. Nevertheless this very often happens be- 
tween vertices which are "young" compared to the age 
of the network [ijj ■ This leads us to believe that ignor- 
ing additional formation of arcs betwen the older ver- 
tices is a reasonable approximation for the growth of the 
Wikipedia-like network. 

The model consists of two steps. In the first one a new 
vertex, introduced in the network at time t and therefore 
labeled as t, attaches to the network with m outgoing 
arcs. The probability that the given arc, from these m 
arcs, will attach itself to some vertex s < t is proportional 
to the in-degree fcj(s) of the vertex s. In the second step 
for every new arc with the probability r a new reciprocal 
arc is formed between vertices s and t. We showed that 
for such a model it is possible to find exact joint degrees 
probability distribution P(ki 7 k ) of a single vertex. For 
example, in the case m — 1 . the distribution has the form 



P{k h k ) = Q{h-k ) 



\l-r) 



fci _ fco Kl + r) (h-l)l 
2 + r (r + 3) fcl _ : 



(2) 



r 



where 0(x) is a usual Heaveside function and (r + 3)^-1 
represents Pochammer symbol [l8|. The asymptotic be- 
havior of the marginal in-degree distribution for the de- 
scribed model is of the form: 

P(h)~k7( 2+r \ (3) 

This solution nicely interpolates between directed and 
undirected cases of the BA model pjl, [2(|. Further- 
more, the asymptotic behavior of the in-degree distribu- 
tion given in ([3]) is also valid for any m,i.e. the power-law 
exponent does not depend on m. In [2J we reported the 
exponent of the in-degree distribution around 7 ~ 2.18 
and values of reciprocity coefficient around r e ~ 0.35. 
The described model for m = 1 predicts the relations 
r e = 2r/(l + r) and 7 = 2 + r, which explain the ob- 
served empirical values of r e and 7. 



the number of vertices in the monitored Wikipedia must 
be the same as the final size of the model network i.e. 
Unodei = N Wikipedia . In this way it is possible to check 
if the model also captures the details of the distribu- 
tion in the tail as well as the power law exponent. The 
second parameter is the number of arcs in the modeled 
Wikipedia. The expected number of arcs obtained in 
the ensemble of model realizations has to be the same as 
the number of arcs measured in the modeled Wikipedia 
i.e. E(L model ) = Lwikipedia- The third parameter is the 
number of reciprocal arcs L^ ikipedia = E(L^). The last 
two empirical parameters depend on our model parame- 
ters as 



L 



Wikipedia 
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tm(l 



(4) 



and 



The three parameters which define the model are: t 
- the size of the modeled network, m - the number of 
outgoing arcs of the new vertex and r - the probability 
of accompanying new arcs with their reciprocal arcs. In 
order to validate the model, we fixed three measured pa- 
rameters of Wikipedia networks which uniquely describe 
the degree distributions obtained in the model. First 



J Wikipedia 



E(L^) = 2tmr. 



(5) 



From these equations it is easy to express our model pa 
rameters as functions of measured quantities: 

r ^Wikipedia 

-^Wikipedia 2 
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Wikipedia 



(6) 
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Wikipedia 



The parameter m obtained from the measured quan- 
tities is not necessarily a natural number, which is sup- 
posed in our analytical treatment [lo| . In order to over- 
come this inconvenience we have used random numbers 
m drawn from Poisson distribution E(m) = m, to be 
the value of m at any given time. We have shown that 
such a distribution has properties almost identical to our 
model with m as a natural number if a suitable E(m) is 
chosen 1161. 
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IV. RESULTS 

In Fig. [5] we show an excellent agreement between 
the in-degree distribution of Japanese Wikipedia and our 
model. It is clear that the mode of the distribution is also 
well described with our model, a feature not so common 
in other degree distribution models found in the litera- 
ture. We have already mentioned that in 0] small and 
medium degrees show a kind of preferential attachment. 
For this reason it is important that our model describes 
well the mode which is formed by the vertices of rela- 
tively small in-degree. The tail of the distribution is also 
very well described by our model. This is important be- 
cause such a tail was found to be a universal feature of 
Wikipcdias in different languages. 

If we compare a cumulative in-degree distribution of 
the model to the one of Wikipcdias (see Fig. [3]), it is 
again easy to see that our model shows a very good agree- 
ment in the tail of the distribution. One can also notice 
that our model shows a deviation from the monitored 
Wikipedia in the very end of the distribution. It is not 
surprising because we have already confirmed that the 
linearity in the attachment principle was observed only 
for small and medium degrees. The largest degrees show 
the "aging" effect i.e. they rarely attract new arcs, be- 
cause their neighborhood is already matured in its con- 
tent. We did not model such a behavior in order to keep 
the model as simple as possible. 

Our model does not reproduce out-degree distribution 
well (see Fig. The mode of the modeled distribution 
is shifted to much to the right and it is too narrow in 
comparison to the realistic out-degree distributions. This 
could be the consequence of using Poisson distribution 
for parameter m, which is too narrow in this case. The 
reason we chose it is just because it is the most easily jus- 
tifiable one parameter distribution for that case. Clearly 
we could get much better result with broader modal dis- 
tributions for the parameter m, but such a choice would 
be hard to justify and would be introduced just for fit- 
ting purposes. Since the aim of this paper is to clarify 
the fundamental role of the reciprocal arcs in the struc- 
ture and growth of Wikipedia network, we focus on the 



Figure 2: Comparison of the in-degree distribution of the 
Japanese Wikipedia with one realization of our model for 
given parameters. It is easy to see excellent agreement both 
between mode of distribution, and exponent (slope in the log- 
log). Chosen parameters are t = 94094, m = 16.75, r = 0.18. 
Our simulations show very similar behavior for the rest of 
studied Wikipedias. 
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Figure 3: Comparison between cumulative in-degree distribu- 
tion of English Wikipedia (blue circles) with one realization 
of our model (red line). Chosen parameters are t = 486291, 
m — 18.24, r = 0.15. The distribution of the model follows 
closely the distribution of the model except in the very end of 
the tail, where we expect the aging effects in the real network. 



version of the model which requires no fitting procedures. 



V. CONCLUSION 

An excellent agreement of the Wikipedia and model in- 
degree distribution confirm that our model is a natural 
continuation of the process of preferential attachment, 
at least for the process of Wikipedia growth. Would the 
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Figure 4: Comparison between out-degree distribution of 
Japanese Wikipedia (blue circles) to one realization of our 
model (red squares). Chosen parameters are t = 94094, 
m = 16.75, r — 0.18. There is no similarity between the 
modes of two distributions, and the slope of the tails seem to 
coincide in the very end of the distribution. 

inclusion of the additional formation of new arcs between 
old vertices improve the agreement is discutable. The 
heuristics for additional changes in the model is not easy 
to justify without additional exploration of Wikipedia 
networks. 

It can be asserted that presented logic of Wikipedia 
growth can be attributed to the Wikipcdians who arc 
editing both old and new articles in a very small time 
frame. In such a case the reciprocity is also a good mea- 
sure of the information interrelatedness in the knowledge 



networks. Clearly, the existence of the reciprocal arcs 
point to a certain intersection of the sets of information 
presented in different articles. Since reciprocity repre- 
sents only the first viable correlation for such informa- 
tion sharing, it can be asserted that even better results 
could be expected if the model would take care of con- 
servation of similar measures such as triad significance 
profile [2l[ or some other local structural motives. Tak- 
ing into account the neighborhood of articles as a pool 
of more probable information sharing could also improve 
quality of the model. Problem with such attempts is the 
increase in the number of parameters which such models 
would require. 

The usefulness of our model in the case of networks 
of different origin is presently not clear. We feel that 
we have demonstrated a significant value of this model 
for understanding of Wikipedia networks and we believe 
that it could also be important in the case of other types 
of knowledge networks with time-dependent formation of 
arcs. Since reciprocity is a natural representation of feed- 
back, the presented model and its extensions could aslo 
be useful in the study of complex systems in which feed- 
back play an important role. The effort in this direction 
is a logical continuation of this research. 
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